Unifying Vision-Language Representation Space with Single-Tower Transformer
نویسندگان
چکیده
Contrastive learning is a form of distance that aims to learn invariant features from two related representations. In this work, we explore the hypothesis an image and caption can be regarded as different views underlying mutual information, train model unified vision-language representation space encodes both modalities at once in modality-agnostic manner. We first identify difficulties one-tower for pretraining (VLP), propose One Representation (OneR) simple yet effective framework our goal. discover intriguing properties distinguish OneR previous works have modality-specific spaces such zero-shot localization, text-guided visual reasoning multi-modal retrieval, present analyses provide insights into new learning. Thorough evaluations demonstrate potential VLP framework.
منابع مشابه
Feature Space Trajectory Representation for Active Vision
A new feature space trajectory (FST) description of 3-D distorted views of an object is advanced for active vision applications. In an FST, di erent distorted object views are vertices in feature space. A new eigen-feature space and Fourier transform features are used. Vertices for di erent adjacent distorted views are connected by straight lines so that an FST is created as the viewpoint chang...
متن کاملUnifying Low-Level Vision
This white paper supports the goal of establishing Computer Vision as a coherent intellectual discipline by suggesting a specific agenda for the unification of many low-level vision principles, algorithms, and data structures. Our goal is to identify a set of highly related low-level vision problems, define their common structure, and establish a coherent intellectual discipline around the shar...
متن کاملConstant Space Complexity Environment Representation for Vision-based Navigation
This paper presents a preliminary conceptual investigation into an environment representation that has constant space complexity with respect to the camera image space. This type of representation allows the planning algorithms of a mobile agent to bypass what are often complex and noisy transformations between camera image space and Euclidean space. The approach is to compute per-pixel potenti...
متن کاملUnifying Class-Based Representation Formalisms
The notion of class is ubiquitous in computer science and is central in many formalisms for the representation of structured knowledge used both in knowledge representation and in databases. In this paper we study the basic issues underlying such representation formalisms and single out both their common characteristics and their distinguishing features. Such investigation leads us to propose a...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2023
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v37i1.25178